update gpu block size based on xattn #2764

rnwang04 · 2025-09-24T02:52:01Z

block size for XAttention
benchmark.py supports XAttention configuration
work with
-- [GPU] XAttention as a preview feature openvino#32064
-- [GPU] XAttention as a preview feature openvino#32551

Tickets: CVS-173845

src/cpp/src/continuous_batching/cache_manager.hpp

tools/llm_bench/task/text_generation.py

Wovchena

I'll leave C++ review to @vshampor

tools/llm_bench/llm_bench_utils/ov_utils.py

….genai into pa_block_xattn

Wovchena

Waiting for Vasily

vshampor

See my previous comment

wwb support xattention

Copilot

Pull Request Overview

This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.

Adds XAttention detection logic based on cache dimensions
Introduces sparse attention configuration support in benchmarking tools
Refactors sparse attention setup into a reusable function

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
src/cpp/src/continuous_batching/cache_manager.hpp	Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled
tools/who_what_benchmark/whowhatbench/model_loaders.py	Extracts sparse attention configuration into a separate function and adds validation logic
tools/llm_bench/llm_bench_utils/ov_utils.py	Adds validation to prevent conflicting sparse attention configuration
tools/llm_bench/task/text_generation.py	Moves GenerationConfig import outside conditional block for broader scope

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tools/llm_bench/llm_bench_utils/ov_utils.py

tools/who_what_benchmark/whowhatbench/model_loaders.py

src/cpp/src/continuous_batching/cache_manager.hpp

ceciliapeng2011 · 2025-10-17T01:08:30Z

The test at

openvino.genai/tests/python_tests/test_kv_cache_eviction.py

Line 105 in 2650995

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

If I understand correctly, "SparseAttentionMode" is set during compiling time when users create a LLMPipeline. The switch between TRISHAPE and XATTENTION will trigger plugin to re-compile the model. Any problem here?

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

That's a good question. I think here you are mentioning GPU KV block size. It is 256. There are two sets of KV block size: 256 for xattention + xe2/xe3+ new GPU architectures, and 16 for legacy GPUs or without xattention. Sergey once suggested to change the block size to 256 everywhere. We have a phase-by-phase plan to do that. Considering execution priority and effort, this work will be happening after full functionality of xattention is productized. We've talked about this in the email thread.

peterchen-intel · 2025-10-17T05:03:41Z

The test at

openvino.genai/tests/python_tests/test_kv_cache_eviction.py

Line 105 in 2650995

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

@vshampor I think we have to go in this way, or else the test case will always fail. Created a ticket to track CVS-175120

Co-authored-by: Copilot <[email protected]>

### Details: - *XAttention for FP16 KVCache as a preview feature* - [x] to add unit tests - [x] to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM) - [x] to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach. - [x] to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc. - [ ] to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input. - [x] to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them. - [x] to enable path of U8 KVCache (stretch goal) - [x] WWB with long prompts This PR should work along with openvinotoolkit/openvino.genai#2764. ### Tickets: - *CVS-173857* --------- Signed-off-by: Zhai, Xuejun <[email protected]> Co-authored-by: river.li <[email protected]> Co-authored-by: Luo Cheng <[email protected]> Co-authored-by: Li, Tingqian <[email protected]> Co-authored-by: rnwang04 <[email protected]> Co-authored-by: Wang Wangwang <[email protected]> Co-authored-by: Chen Peter <[email protected]> Co-authored-by: Zhai, Xuejun <[email protected]> Co-authored-by: Luwei Zhou <[email protected]>

### Details: - *XAttention for FP16 KVCache as a preview feature* - [x] to add unit tests - [x] to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM) - [x] to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach. - [x] to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc. - [ ] to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input. - [x] to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them. - [x] to enable path of U8 KVCache (stretch goal) - [x] WWB with long prompts This PR should work along with openvinotoolkit/openvino.genai#2764. ### Tickets: - *CVS-173857* Same as [PR32064](#32064) 14e57f9 $ git fetch origin pull/32064/head:pr/32064 $ git fetch origin pull/32551/head:pr/32551 $ git diff pr/32551 pr/32064 (empty) To resolve [PR32064](#32064) commits checks issue. one commit is signed by unknown author. 5201cdf https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification/about-commit-signature-verification --------- Signed-off-by: Zhai, Xuejun <[email protected]> Signed-off-by: Chen, Peter <[email protected]> Co-authored-by: river.li <[email protected]> Co-authored-by: ceciliapeng2011 <[email protected]> Co-authored-by: Luo Cheng <[email protected]> Co-authored-by: Li, Tingqian <[email protected]> Co-authored-by: rnwang04 <[email protected]> Co-authored-by: Zhai, Xuejun <[email protected]> Co-authored-by: Luwei Zhou <[email protected]> Co-authored-by: Wang Wangwang <[email protected]>

peterchen-intel · 2025-10-26T11:51:40Z

The test at

openvino.genai/tests/python_tests/test_kv_cache_eviction.py

Line 105 in 2650995

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

If I understand correctly, "SparseAttentionMode" is set during compiling time when users create a LLMPipeline. The switch between TRISHAPE and XATTENTION will trigger plugin to re-compile the model. Any problem here?

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

That's a good question. I think here you are mentioning GPU KV block size. It is 256. There are two sets of KV block size: 256 for xattention + xe2/xe3+ new GPU architectures, and 16 for legacy GPUs or without xattention. Sergey once suggested to change the block size to 256 everywhere. We have a phase-by-phase plan to do that. Considering execution priority and effort, this work will be happening after full functionality of xattention is productized. We've talked about this in the email thread.

@vshampor @ceciliapeng2011 Let's continue the discussion in CVS-175590 (GPU KV block size) and CVS-175120 (test XATTN with CPU)

Agreed in chat that continue the merge of this PR and follow up discussion.

peterchen-intel · 2025-10-26T23:34:45Z

tools/who_what_benchmark/whowhatbench/model_loaders.py



-def get_scheduler_config_genai(cb_config):
+def configure_sparse_attention(scheduler_params, scheduler_config):


#2895 is working on the same purpose.

update gpu block size based on xattn

352c7a0

github-actions bot added the category: continuous batching Continuous batching label Sep 24, 2025

ceciliapeng2011 reviewed Sep 24, 2025

View reviewed changes

src/cpp/src/continuous_batching/cache_manager.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 marked this pull request as draft September 24, 2025 05:32

update gpu block size based on xattn

51d0018

rnwang04 force-pushed the pa_block_xattn branch from 5a146f4 to e8311d7 Compare September 25, 2025 02:41

merge

eded411

rnwang04 force-pushed the pa_block_xattn branch from e8311d7 to eded411 Compare September 25, 2025 02:45

add missing GenerationConfig

ac7a454

rnwang04 force-pushed the pa_block_xattn branch from b56c1c5 to ac7a454 Compare September 26, 2025 06:01

github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Sep 26, 2025

fix use_sparse_attention=False

e6ba90a

This was referenced Oct 11, 2025

chang gpu_block_size to 256 #2727

Closed

[GPU] XAttention as a preview feature openvinotoolkit/openvino#32064

Closed

ceciliapeng2011 marked this pull request as ready for review October 11, 2025 08:15

Merge branch 'master' into pa_block_xattn

4101008

ceciliapeng2011 assigned vshampor and Wovchena Oct 11, 2025

Wovchena reviewed Oct 13, 2025

View reviewed changes

tools/llm_bench/task/text_generation.py Show resolved Hide resolved

rnwang04 and others added 2 commits October 13, 2025 16:26

update GenerationConfig based on comments

9dcb2da

refactor: get gpu block_size from value_cache.

c5c67ba

Wovchena reviewed Oct 13, 2025

View reviewed changes

tools/llm_bench/llm_bench_utils/ov_utils.py Outdated Show resolved Hide resolved

tools/llm_bench/llm_bench_utils/ov_utils.py Show resolved Hide resolved

rnwang04 added 3 commits October 13, 2025 22:34

update solution for use_sparse_attention=False based on comments

9ccf083

Merge branch 'pa_block_xattn' of https://github.com/rnwang04/openvino…

79fd027

….genai into pa_block_xattn

Merge branch 'master' into pa_block_xattn

c96728c

Wovchena approved these changes Oct 13, 2025

View reviewed changes

ceciliapeng2011 added this to the 2025.4 milestone Oct 14, 2025

peterchen-intel and others added 3 commits October 15, 2025 07:55

Merge branch 'master' into pa_block_xattn

9c27fc1

add log to show if XAttention is actually ON/OFF.

9ab91e1

fix

526ba22

vshampor previously requested changes Oct 15, 2025

View reviewed changes

wgzintel and others added 4 commits October 15, 2025 23:29

wwb support xattention

7bd2aaf

remove blank line

9c752b9

refactoring the code

13053a9

Merge pull request #2 from wgzintel/guozhong/wwb_support_xattention

8d481e8

wwb support xattention

github-actions bot added the category: WWB PR changes WWB label Oct 16, 2025

Wovchena requested a review from Copilot October 16, 2025 11:23

Copilot AI reviewed Oct 16, 2025

View reviewed changes

peterchen-intel and others added 2 commits October 20, 2025 07:53

Merge branch 'master' into pa_block_xattn

a740969

Code format update

3cdda53

Co-authored-by: Copilot <[email protected]>

ceciliapeng2011 requested a review from vshampor October 20, 2025 03:20

peterchen-intel and others added 4 commits October 22, 2025 03:22

Merge branch 'master' into pa_block_xattn

516bff5

refactor based on copilot review

9f46be6

fix lint error

77f7961

fix lint error

66950d5

Merge branch 'master' into pa_block_xattn

237cb5c

peterchen-intel mentioned this pull request Oct 25, 2025

[GPU] XAttention as a preview feature openvinotoolkit/openvino#32551

Merged

8 tasks

Merge branch 'master' into pa_block_xattn

e5bd53e

peterchen-intel enabled auto-merge October 26, 2025 13:58

peterchen-intel disabled auto-merge October 26, 2025 23:22

peterchen-intel added the Code Freeze label Oct 26, 2025

peterchen-intel requested review from as-suvorov and sbalandi October 26, 2025 23:31

peterchen-intel reviewed Oct 26, 2025

View reviewed changes



		def get_scheduler_config_genai(cb_config):
		def configure_sparse_attention(scheduler_params, scheduler_config):

Uh oh!

update gpu block size based on xattn #2764

Are you sure you want to change the base?

update gpu block size based on xattn #2764

Conversation

rnwang04 commented Sep 24, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Choose a reason for hiding this comment

Uh oh!

vshampor left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ceciliapeng2011 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterchen-intel commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterchen-intel commented Oct 26, 2025

Uh oh!

peterchen-intel Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rnwang04 commented Sep 24, 2025 •

edited by peterchen-intel

Loading

ceciliapeng2011 commented Oct 17, 2025 •

edited

Loading

peterchen-intel commented Oct 17, 2025 •

edited

Loading

peterchen-intel Oct 26, 2025 •

edited

Loading